display(HTML(sales_dist_div))
display(HTML(order_dist_div))
display(HTML(spo_dist_div))Table of contents
Exploratory data analysis
1.1 Revenue decomposition by day of the week
1.2 Revenue decomposition by type of day and season
1.3 Trends
1.4 HCF impact on salesRegression analysis - HCF
2.1 HCF impact - Item A
2.2 HCF impact - Item BRegression analysis - sales
3.1 Understanding Item A sales
3.2 Understanding Item B salesAppendix 1 - Multicollinearity checks
Appendix 2 - Residual plots
1. Exploratory data analysis
1.1 Revenue decomposition by day of the week
Analysis
Weekends show the highest overall demand, but also exhibit the greatest fluctuations in sales (greater spread in boxplots).
Differences in sales between weekends and weekdays appear to be driven by variations in customer traffic, as the distribution of sales per order remains consistent across the week.
Weekdays demand is relatively stable, with overlapping sales distributions across Mondays to Thursdays.
Sales on holidays are comparable to Fridays, suggesting strong demand.
Wednesday holidays (e.g., Christmas Day and New Year’s Day) saw higher sales per order but fewer total orders, possibly reflecting larger group purchases or celebrations.
1.2 Revenue decomposition by type of day and season
rev_decomp_fig.show()Analysis
Summer revenue was driven by a high number of orders, however sales per order remained comparable to other the seasons.
Fall and Winter saw similar customer traffic, but Winter had higher average sales per order.
In Spring, the number of orders increased notably compared to Winter, though sales per order remained relatively stable.
Across all seasons except Summer, holidays experienced higher average sales per order when compared to weekends and Fridays.
Weekends consistently recorded higher sales per order than Fridays, regardless of season.
Recommendations
…
Statistical tests on recommendations
A. Is the average number of orders higher during the Spring/Summer compared to the Fall/Winter?
order_warm = combined_df[(combined_df['season']=='Spring') | (combined_df['season']=='Summer')]['in_store_orders']
order_cold = combined_df[(combined_df['season']=='Fall') | (combined_df['season']=='Winter')]['in_store_orders']# test for equal variance, a ratio under 2 (or above 0.5) is considered acceptable
(order_warm.describe()['std'] / order_cold.describe()['std']).round(2)1.22
# perform two-sample t-test with equal variance
statistic, p_value = ttest_ind(order_warm, order_cold, alternative='greater')
print(f'Test statistic: {statistic:.3f}')
print(f'p-value: {p_value:.3f}')Test statistic: 8.984
p-value: 0.000
There is statistical evidence suggesting the average number of orders is higher during the Spring/Summer compared to the Fall/Winter.
B. Is the average sales per order higher during holiday’s compared to Fridays and Weekends (non-Summer seasons)?
Note: We will focus our test on comparing holidays and weekends. Since weekends already exhibit higher average sales per order than Fridays, a significant difference between holidays and weekends would also imply holidays differ sigificantly from Fridays.
spo_holiday = combined_df[(combined_df['type_of_day']=='Holiday') & (combined_df['season']!='Summer')]['sales_per_order']
spo_weekend = combined_df[(combined_df['type_of_day']=='Weekend') & (combined_df['season']!='Summer')]['sales_per_order']# test for equal variance, a ratio under 2 is considered acceptable
(spo_holiday.describe()['std'] / spo_weekend.describe()['std']).round(2)1.31
# since our sample for holiday's is small, let's perform a Shapiro-Wilk test to check for normality
statistic, p_value = shapiro(spo_holiday)
print(f'Test statistic: {statistic:.3f}')
print(f'p-value: {p_value:.3f}')
if p_value < 0.05:
print('The data is not normally distributed.')
else:
print('The data is normally distributed.')Test statistic: 0.896
p-value: 0.164
The data is normally distributed.
# perform two-sample t-test
statistic, p_value = ttest_ind(spo_holiday, spo_weekend, alternative='greater')
print(f'Test statistic: {statistic:.3f}')
print(f'p-value: {p_value:.3f}')Test statistic: 1.797
p-value: 0.038
There is statistical evidence suggesting the average sales per order is higher on holidays than weekends.
1.3 Trends
weekly_trend.show()
sales_spo.show()Analysis
Revenue during slow period (much of fall and winter) was dragged down by lower foot traffic, though those who visit spend more compared to the summer months.
The period where item A sales overtook item B (late October to late February), coincides with stronger sales per order but fewer total orders. This suggests:
- The business lose a chunk of customers who buys item B (lower dollar amount) during slow periods;
- Item A customers are more stable throughout the year and spend more per transaction.
HCF did not appear to have driven sales in other items.
Recommendations
…
1.4 HCF impact on sales
As HCF happened during the Winter, we’ll compare sales during the HCF days and non-HCF days in the Winter only for a more accurate comparison.
HCF_A_fig.show()
HCF_B_fig.show()Analysis
HCF did not appear to have a strong effect on item A sales. Most days exhibited overlapping distributions, with Saturdays being the only exception.
HCF seems to have suppresssed item B sales, with three of the days exhibiting notably lower sales.
To supplement the graphical analysis above, we’ll perform a regression analysis.
2. Regression analysis - HCF
2.1 HCF impact - Item A
hcf_inference_df = inference_df.copy()
hcf_inference_df = hcf_inference_df[hcf_inference_df['season']=='Winter']
hcf_A_df = hcf_inference_df[['item_A_sales', 'rain', 'snow', 'is_long_weekend', 'is_holiday', 'day_of_week', 'is_HCF']]formula = (
"""
item_A_sales ~
+ rain
+ snow
+ is_long_weekend
+ C(is_holiday)
+ C(day_of_week)
+ C(is_HCF)
"""
)
hcf_A_model = smf.ols(formula, data=hcf_A_df).fit()
print(hcf_A_model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: item_A_sales R-squared: 0.791
Model: OLS Adj. R-squared: 0.761
Method: Least Squares F-statistic: 26.79
Date: Wed, 13 Aug 2025 Prob (F-statistic): 4.32e-22
Time: 23:18:23 Log-Likelihood: -602.93
No. Observations: 90 AIC: 1230.
Df Residuals: 78 BIC: 1260.
Df Model: 11
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 1718.3903 63.038 27.260 0.000 1592.891 1843.889
is_long_weekend[T.True] 103.4549 134.401 0.770 0.444 -164.116 371.026
C(is_holiday)[T.True] 825.4585 124.029 6.655 0.000 578.535 1072.382
C(day_of_week)[T.Tuesday] -87.6350 84.179 -1.041 0.301 -255.223 79.953
C(day_of_week)[T.Wednesday] 56.9839 85.123 0.669 0.505 -112.483 226.451
C(day_of_week)[T.Thursday] 12.3040 84.429 0.146 0.885 -155.781 180.389
C(day_of_week)[T.Friday] 442.7401 85.866 5.156 0.000 271.793 613.687
C(day_of_week)[T.Saturday] 801.0451 86.876 9.221 0.000 628.089 974.001
C(day_of_week)[T.Sunday] 831.5385 83.651 9.941 0.000 665.002 998.075
C(is_HCF)[T.True] 83.0554 53.258 1.559 0.123 -22.973 189.084
rain -2.1743 4.138 -0.525 0.601 -10.413 6.065
snow -36.6697 24.328 -1.507 0.136 -85.103 11.763
==============================================================================
Omnibus: 20.793 Durbin-Watson: 2.276
Prob(Omnibus): 0.000 Jarque-Bera (JB): 29.960
Skew: 1.027 Prob(JB): 3.12e-07
Kurtosis: 4.942 Cond. No. 53.9
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Analysis
The coefficient for
C(is_HCF)[T.True]is not statistically different from 0, indicating there is no strong evidence that item A sales during HCF differed from sales during non-HCF periods, after controlling for other variables. The positive coefficient could be interpreted as weak evidence that HCF did drive some item A, but not by a significant amount.Item A sales during the Winter was still primarily explained by the weekly seasonality and holiday effects.
2.2 HCF impact - Item B
hcf_b_df = hcf_inference_df[['item_B_sales', 'rain', 'snow', 'is_long_weekend', 'is_holiday', 'day_of_week', 'is_HCF']]formula = (
"""
item_B_sales ~
+ rain
+ snow
+ is_long_weekend
+ C(is_holiday)
+ C(day_of_week)
+ C(is_HCF)
"""
)
hcf_B_model = smf.ols(formula, data=hcf_b_df).fit()
print(hcf_B_model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: item_B_sales R-squared: 0.740
Model: OLS Adj. R-squared: 0.703
Method: Least Squares F-statistic: 20.17
Date: Wed, 13 Aug 2025 Prob (F-statistic): 1.57e-18
Time: 23:18:23 Log-Likelihood: -606.43
No. Observations: 90 AIC: 1237.
Df Residuals: 78 BIC: 1267.
Df Model: 11
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 1665.4529 65.538 25.412 0.000 1534.977 1795.929
is_long_weekend[T.True] 419.5671 139.730 3.003 0.004 141.385 697.749
C(is_holiday)[T.True] 415.9336 128.948 3.226 0.002 159.218 672.649
C(day_of_week)[T.Tuesday] -139.2911 87.517 -1.592 0.116 -313.525 34.942
C(day_of_week)[T.Wednesday] -133.0181 88.499 -1.503 0.137 -309.206 43.170
C(day_of_week)[T.Thursday] -48.4912 87.777 -0.552 0.582 -223.242 126.259
C(day_of_week)[T.Friday] 421.4689 89.271 4.721 0.000 243.743 599.195
C(day_of_week)[T.Saturday] 680.5938 90.321 7.535 0.000 500.779 860.409
C(day_of_week)[T.Sunday] 528.6963 86.968 6.079 0.000 355.555 701.837
C(is_HCF)[T.True] -25.1586 55.370 -0.454 0.651 -135.392 85.075
rain -0.8709 4.303 -0.202 0.840 -9.437 7.695
snow -59.4070 25.293 -2.349 0.021 -109.761 -9.053
==============================================================================
Omnibus: 0.036 Durbin-Watson: 1.592
Prob(Omnibus): 0.982 Jarque-Bera (JB): 0.191
Skew: 0.002 Prob(JB): 0.909
Kurtosis: 2.774 Cond. No. 53.9
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Analysis
Similar to item A, the coefficient for
C(is_HCF)[T.True]is not statistically different from 0, indicating there is no strong evidence that item B sales during HCF differed from sales during non-HCF periods, after controlling for other variables.Item B sales during the Winter were also primarily explained by the weekly seasonality and holiday effects, but compared to item A, long weekends and weather conditions also contributed.
3. Regression analysis - sales
3.1 Understanding Item A sales
formula = (
"""
log_A ~
+ rain
+ snow
+ is_long_weekend
+ C(is_holiday)
+ C(season)
+ C(day_of_week)
"""
)
A_model = smf.ols(formula, data=A_sales_df).fit()
print(A_model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: log_A R-squared: 0.765
Model: OLS Adj. R-squared: 0.755
Method: Least Squares F-statistic: 81.22
Date: Wed, 13 Aug 2025 Prob (F-statistic): 1.04e-93
Time: 23:18:23 Log-Likelihood: 296.79
No. Observations: 339 AIC: -565.6
Df Residuals: 325 BIC: -512.0
Df Model: 13
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 7.4555 0.019 395.924 0.000 7.418 7.493
is_long_weekend[T.True] 0.0023 0.026 0.089 0.929 -0.049 0.054
C(is_holiday)[T.True] 0.3111 0.035 8.773 0.000 0.241 0.381
C(season)[T.Spring] 0.0926 0.017 5.528 0.000 0.060 0.126
C(season)[T.Summer] 0.1608 0.016 10.344 0.000 0.130 0.191
C(season)[T.Fall] -0.0198 0.016 -1.265 0.207 -0.051 0.011
C(day_of_week)[T.Tuesday] 0.0078 0.022 0.358 0.720 -0.035 0.051
C(day_of_week)[T.Wednesday] 0.0307 0.022 1.412 0.159 -0.012 0.073
C(day_of_week)[T.Thursday] 0.0130 0.022 0.598 0.550 -0.030 0.056
C(day_of_week)[T.Friday] 0.2332 0.022 10.661 0.000 0.190 0.276
C(day_of_week)[T.Saturday] 0.4024 0.022 18.119 0.000 0.359 0.446
C(day_of_week)[T.Sunday] 0.3527 0.022 16.002 0.000 0.309 0.396
rain -0.0012 0.001 -1.377 0.169 -0.003 0.000
snow -0.0119 0.010 -1.158 0.248 -0.032 0.008
==============================================================================
Omnibus: 24.661 Durbin-Watson: 1.235
Prob(Omnibus): 0.000 Jarque-Bera (JB): 37.402
Skew: 0.500 Prob(JB): 7.56e-09
Kurtosis: 4.283 Cond. No. 64.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Analysis
Item A sales are not strongly correlated with weather conditions (rain or snow), indicating that customer behavior is relatively unaffected by these factors.
Variations in item A sales are primarily driven by seasonality — both in terms of the day of the week and the actual season — as well as whether the day is a holiday.
Model intepretation
exp_coef = pd.DataFrame({
'features': A_model.params.index,
'coeff': np.exp(A_model.params.values).round(2),
'p-value': A_model.pvalues.values.round(3)
})
exp_coef.set_index('features', inplace=True)
exp_coef| coeff | p-value | |
|---|---|---|
| features | ||
| Intercept | 1729.27 | 0.000 |
| is_long_weekend[T.True] | 1.00 | 0.929 |
| C(is_holiday)[T.True] | 1.36 | 0.000 |
| C(season)[T.Spring] | 1.10 | 0.000 |
| C(season)[T.Summer] | 1.17 | 0.000 |
| C(season)[T.Fall] | 0.98 | 0.207 |
| C(day_of_week)[T.Tuesday] | 1.01 | 0.720 |
| C(day_of_week)[T.Wednesday] | 1.03 | 0.159 |
| C(day_of_week)[T.Thursday] | 1.01 | 0.550 |
| C(day_of_week)[T.Friday] | 1.26 | 0.000 |
| C(day_of_week)[T.Saturday] | 1.50 | 0.000 |
| C(day_of_week)[T.Sunday] | 1.42 | 0.000 |
| rain | 1.00 | 0.169 |
| snow | 0.99 | 0.248 |
Base case: Average item A sales = $1729.27 on a regular Monday in Winter with no rain and no snow.
is_long_weekend - No significant impact on sales.
is_holiday - All else being equal, on average, sales during on a holiday is roughly 36% higher than the base case.
season - No significant difference in item A sales between Winter and Fall.
- Compared to Winter, and all else being equal:
- Spring sales are 10% higher
- Summer sales are 17% higher
day_of_week - No significant difference in item A sales between the weekdays (Mon - Thurs).
- Relative to Monday, and all else being equal:
- Friday sales are 26% higher
- Saturday sales are 50% higher
- Sunday sales are 42% higher
rain - No significant impact on sales.
snow - No significant impact on sales.
For example:
Average item A sales on a Friday in the Summer with no rain and no snow = $1729.27 x 1.26 x 1.17 = $2549.28
3.2 Understanding Item B sales
B_sales_df = inference_df[['log_B', 'hours_opened', 'avg_temperature', 'rain', 'snow',
'is_long_weekend', 'HCF_sales', 'is_holiday', 'season', 'day_of_week']]formula = (
"""
log_B ~
+ rain
+ snow
+ is_long_weekend
+ C(is_holiday)
+ C(season)
+ C(day_of_week)
"""
)
B_model = smf.ols(formula, data=B_sales_df).fit()
print(B_model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: log_B R-squared: 0.848
Model: OLS Adj. R-squared: 0.842
Method: Least Squares F-statistic: 140.0
Date: Wed, 13 Aug 2025 Prob (F-statistic): 1.59e-124
Time: 23:18:23 Log-Likelihood: 259.21
No. Observations: 339 AIC: -490.4
Df Residuals: 325 BIC: -436.9
Df Model: 13
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 7.3495 0.021 349.337 0.000 7.308 7.391
is_long_weekend[T.True] 0.0977 0.029 3.319 0.001 0.040 0.156
C(is_holiday)[T.True] 0.2691 0.040 6.792 0.000 0.191 0.347
C(season)[T.Spring] 0.1546 0.019 8.261 0.000 0.118 0.191
C(season)[T.Summer] 0.4366 0.017 25.145 0.000 0.402 0.471
C(season)[T.Fall] 0.0472 0.017 2.703 0.007 0.013 0.082
C(day_of_week)[T.Tuesday] -0.0057 0.024 -0.234 0.815 -0.054 0.042
C(day_of_week)[T.Wednesday] 0.0287 0.024 1.185 0.237 -0.019 0.076
C(day_of_week)[T.Thursday] 0.0447 0.024 1.842 0.066 -0.003 0.092
C(day_of_week)[T.Friday] 0.3131 0.024 12.807 0.000 0.265 0.361
C(day_of_week)[T.Saturday] 0.4406 0.025 17.760 0.000 0.392 0.489
C(day_of_week)[T.Sunday] 0.3849 0.025 15.633 0.000 0.336 0.433
rain -0.0064 0.001 -6.821 0.000 -0.008 -0.005
snow -0.0341 0.012 -2.966 0.003 -0.057 -0.011
==============================================================================
Omnibus: 5.666 Durbin-Watson: 1.408
Prob(Omnibus): 0.059 Jarque-Bera (JB): 7.178
Skew: -0.127 Prob(JB): 0.0276
Kurtosis: 3.666 Cond. No. 64.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Analysis
Item B sales is particularly weak during the Winter.
Item B sales are more sensitive to weather conditions, weekly seasonality and long weekends.
Model interpretation
exp_coef = pd.DataFrame({
'features': B_model.params.index,
'coeff': np.exp(B_model.params.values).round(2),
'p-value': B_model.pvalues.values.round(3)
})
exp_coef.set_index('features', inplace=True)
exp_coef| coeff | p-value | |
|---|---|---|
| features | ||
| Intercept | 1555.37 | 0.000 |
| is_long_weekend[T.True] | 1.10 | 0.001 |
| C(is_holiday)[T.True] | 1.31 | 0.000 |
| C(season)[T.Spring] | 1.17 | 0.000 |
| C(season)[T.Summer] | 1.55 | 0.000 |
| C(season)[T.Fall] | 1.05 | 0.007 |
| C(day_of_week)[T.Tuesday] | 0.99 | 0.815 |
| C(day_of_week)[T.Wednesday] | 1.03 | 0.237 |
| C(day_of_week)[T.Thursday] | 1.05 | 0.066 |
| C(day_of_week)[T.Friday] | 1.37 | 0.000 |
| C(day_of_week)[T.Saturday] | 1.55 | 0.000 |
| C(day_of_week)[T.Sunday] | 1.47 | 0.000 |
| rain | 0.99 | 0.000 |
| snow | 0.97 | 0.003 |
Base case: Average item B sales = $1555.37 on a regular Monday in Winter with no rain and no snow.
is_long_weekend - All else being equal, on average, sales are approximately 10% higher on long weekends compared to regular weekends.
is_holiday - All else being equal, on average, sales during on a holiday is roughly 31% higher than the base case.
season - Compared to Winter, and all else being equal: - Spring sales are 17% higher - Summer sales are 55% higher - Fall sales are 5% higher
day_of_week - No significant difference in item B sales between the weekdays (Mon - Thurs).
- Relative to Monday, and all else being equal:
- Friday sales are 37% higher
- Saturday sales are 55% higher
- Sunday sales are 47% higher
rain - Each additional 1mm of rainfall is associated with a 1% decrease in item B sales, all else being equal.
snow - Each additional 1mm of snowfall is associated with a 3% decrease in item B sales, all else being equal.
For example:
Average item B sales on a Friday in the Fall with 3mm of rain and no snow = $1555.37 x 1.37 x 1.05 x 0.99^3 = $2170.95
Understanding item C sales
C_sales_df = inference_df[['log_C', 'hours_opened', 'avg_temperature', 'rain', 'snow',
'is_long_weekend', 'HCF_sales', 'is_holiday', 'season', 'day_of_week']]formula = (
"""
log_C ~
+ rain
+ snow
+ is_long_weekend
+ C(is_holiday)
+ C(season)
+ C(day_of_week)
"""
)
C_model = smf.ols(formula, data=C_sales_df).fit()
print(C_model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: log_C R-squared: 0.605
Model: OLS Adj. R-squared: 0.590
Method: Least Squares F-statistic: 38.36
Date: Wed, 13 Aug 2025 Prob (F-statistic): 8.52e-58
Time: 23:18:23 Log-Likelihood: 468.52
No. Observations: 339 AIC: -909.0
Df Residuals: 325 BIC: -855.5
Df Model: 13
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 7.0347 0.011 619.988 0.000 7.012 7.057
is_long_weekend[T.True] 0.0153 0.016 0.966 0.335 -0.016 0.047
C(is_holiday)[T.True] 0.0949 0.021 4.443 0.000 0.053 0.137
C(season)[T.Spring] 0.0757 0.010 7.501 0.000 0.056 0.096
C(season)[T.Summer] 0.0924 0.009 9.869 0.000 0.074 0.111
C(season)[T.Fall] 0.0074 0.009 0.784 0.434 -0.011 0.026
C(day_of_week)[T.Tuesday] -0.0084 0.013 -0.641 0.522 -0.034 0.017
C(day_of_week)[T.Wednesday] 0.0131 0.013 1.005 0.316 -0.013 0.039
C(day_of_week)[T.Thursday] 0.0136 0.013 1.038 0.300 -0.012 0.039
C(day_of_week)[T.Friday] 0.0864 0.013 6.552 0.000 0.060 0.112
C(day_of_week)[T.Saturday] 0.1502 0.013 11.226 0.000 0.124 0.177
C(day_of_week)[T.Sunday] 0.1238 0.013 9.326 0.000 0.098 0.150
rain -0.0006 0.001 -1.221 0.223 -0.002 0.000
snow -0.0131 0.006 -2.108 0.036 -0.025 -0.001
==============================================================================
Omnibus: 11.795 Durbin-Watson: 1.258
Prob(Omnibus): 0.003 Jarque-Bera (JB): 13.369
Skew: 0.359 Prob(JB): 0.00125
Kurtosis: 3.657 Cond. No. 64.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
exp_coef = pd.DataFrame({
'features': C_model.params.index,
'coeff': np.exp(C_model.params.values).round(2),
'p-values': C_model.pvalues.values.round(3)
})
exp_coef.set_index('features', inplace=True)
exp_coef| coeff | p-values | |
|---|---|---|
| features | ||
| Intercept | 1135.40 | 0.000 |
| is_long_weekend[T.True] | 1.02 | 0.335 |
| C(is_holiday)[T.True] | 1.10 | 0.000 |
| C(season)[T.Spring] | 1.08 | 0.000 |
| C(season)[T.Summer] | 1.10 | 0.000 |
| C(season)[T.Fall] | 1.01 | 0.434 |
| C(day_of_week)[T.Tuesday] | 0.99 | 0.522 |
| C(day_of_week)[T.Wednesday] | 1.01 | 0.316 |
| C(day_of_week)[T.Thursday] | 1.01 | 0.300 |
| C(day_of_week)[T.Friday] | 1.09 | 0.000 |
| C(day_of_week)[T.Saturday] | 1.16 | 0.000 |
| C(day_of_week)[T.Sunday] | 1.13 | 0.000 |
| rain | 1.00 | 0.223 |
| snow | 0.99 | 0.036 |
Base case: Average item C sales = $1135.40 on a regular Monday in Winter with no rain and no snow.
is_long_weekend - No significant impact on sales.
is_holiday - All else being equal, on average, sales during on a holiday is roughly 10% higher than the base case.
season - No significant difference in item C sales between Winter and Fall.
- Compared to Winter, and all else being equal:
- Spring sales are 8% higher
- Summer sales are 10% higher
day_of_week - No significant difference in item C sales between the weekdays (Mon - Thurs).
- Relative to Monday, and all else being equal:
- Friday sales are 9% higher
- Saturday sales are 16% higher
- Sunday sales are 13% higher
rain - No significant impact on sales.
snow - Each additional 1mm of snowfall is associated with a 1% decrease in item C sales, all else being equal.
For example:
Average item C sales on a Sunday in the Spring with no snow = $1135.40 x 1.13 x 1.08 = $1385.64
Appendix 1 - multicollinearity checks
HCF item A / item B models
# check for multicollinearity with VIF - rule of thumb VIF < 5 is good.
X = pd.get_dummies(hcf_A_df, drop_first=True)
X = X.drop(columns=['item_A_sales'])
vif = [variance_inflation_factor(X.values, index) for index, value in enumerate(X.columns)]
vif_df = pd.DataFrame({
'Features': X.columns,
'VIF': vif
})
vif_df| Features | VIF | |
|---|---|---|
| 0 | rain | 1.685095 |
| 1 | snow | 1.172261 |
| 2 | is_long_weekend_True | 1.184665 |
| 3 | is_holiday_True | 1.375031 |
| 4 | day_of_week_Tuesday | 1.169418 |
| 5 | day_of_week_Wednesday | 1.241626 |
| 6 | day_of_week_Thursday | 1.178908 |
| 7 | day_of_week_Friday | 1.258241 |
| 8 | day_of_week_Saturday | 1.258572 |
| 9 | day_of_week_Sunday | 1.186628 |
| 10 | is_HCF_True | 1.656635 |
- No signs of multicollinearity among the existing variables.
Item A / item B sales models
# check for multicollinearity with VIF - rule of thumb VIF < 5 is good.
X = pd.get_dummies(A_sales_df, drop_first=True)
X = X.drop(columns=['log_A'])
vif = [variance_inflation_factor(X, index) for index, value in enumerate(X.columns)]
vif_df = pd.DataFrame({
'Features': X.columns,
'VIF': vif
})
vif_df| Features | VIF | |
|---|---|---|
| 0 | hours_opened | 17.307752 |
| 1 | avg_temperature | 17.779757 |
| 2 | rain | 1.357277 |
| 3 | snow | 1.128092 |
| 4 | HCF_sales | 1.876263 |
| 5 | is_long_weekend_True | 1.644316 |
| 6 | is_holiday_True | 1.542749 |
| 7 | season_Spring | 2.402694 |
| 8 | season_Summer | 7.137513 |
| 9 | season_Fall | 3.488629 |
| 10 | day_of_week_Tuesday | 2.138480 |
| 11 | day_of_week_Wednesday | 2.159496 |
| 12 | day_of_week_Thursday | 2.149954 |
| 13 | day_of_week_Friday | 2.368185 |
| 14 | day_of_week_Saturday | 2.467691 |
| 15 | day_of_week_Sunday | 2.357782 |
- The results indicate there is multicollinearity between
hours_opened,avg_temperatureandseason_Summer. Sincehours_openedis largely explained by the day of the week andavg_temperatureis correlated with seasonal effects, we will drop these two variables to reduce redundancy and improve model stability.
# check for multicollinearity again after dropping 'hours_opened' and 'avg_temperature'
X = pd.get_dummies(A_sales_df, drop_first=True)
X = X.drop(columns=['log_A', 'hours_opened', 'avg_temperature'])
vif = [variance_inflation_factor(X, index) for index, value in enumerate(X.columns)]
vif_df = pd.DataFrame({
'Features': X.columns,
'VIF': vif
})
vif_df| Features | VIF | |
|---|---|---|
| 0 | rain | 1.327447 |
| 1 | snow | 1.115561 |
| 2 | HCF_sales | 1.363251 |
| 3 | is_long_weekend_True | 1.634335 |
| 4 | is_holiday_True | 1.467373 |
| 5 | season_Spring | 1.646864 |
| 6 | season_Summer | 1.853490 |
| 7 | season_Fall | 1.911962 |
| 8 | day_of_week_Tuesday | 1.439529 |
| 9 | day_of_week_Wednesday | 1.457562 |
| 10 | day_of_week_Thursday | 1.431270 |
| 11 | day_of_week_Friday | 1.460691 |
| 12 | day_of_week_Saturday | 1.612651 |
| 13 | day_of_week_Sunday | 1.537483 |
- There are no longer signs of multicollinearity after dropping
hours_openedandavg_temperature.
Appendix 2: regression models - residual plots
HCF - item A model
hcf_A_residuals = hcf_A_model.resid
hcf_A_fitted_values = hcf_A_model.fittedvalues
hcf_A_scatter = px.scatter(x=hcf_A_fitted_values, y=hcf_A_residuals)
hcf_A_scatter.add_hline(
y=0,
line_dash='dash',
line_color='red'
)
hcf_A_scatter.update_layout(
title='Fitted values vs residuals',
yaxis=dict(
title='Residuals'
),
xaxis=dict(
title='Fitted values'
),
width=600,
height=400
)
hcf_A_scatter.show()There is no discernable pattern in the residuals.
HCF - item B model
hcf_B_residuals = hcf_B_model.resid
hcf_B_fitted_values = hcf_B_model.fittedvalues
hcf_B_scatter = px.scatter(x=hcf_B_fitted_values, y=hcf_B_residuals)
hcf_B_scatter.add_hline(
y=0,
line_dash='dash',
line_color='red'
)
hcf_B_scatter.update_layout(
title='Fitted values vs residuals',
yaxis=dict(
title='Residuals'
),
xaxis=dict(
title='Fitted values'
),
width=600,
height=400
)
hcf_B_scatter.show()There is no discernable pattern in the residuals.
Item A model
A_residuals = A_model.resid
A_fitted_values = A_model.fittedvalues
A_scatter = px.scatter(x=A_fitted_values, y=A_residuals)
A_scatter.add_hline(
y=0,
line_dash='dash',
line_color='red'
)
A_scatter.update_layout(
title='Fitted values vs residuals',
yaxis=dict(
title='Residuals'
),
xaxis=dict(
title='Fitted values'
),
width=600,
height=400
)
A_scatter.show()There is no discernable pattern in the residuals.
Item B model
B_residuals = B_model.resid
B_fitted_values = B_model.fittedvalues
B_scatter = px.scatter(x=B_fitted_values, y=B_residuals)
B_scatter.add_hline(
y=0,
line_dash='dash',
line_color='red'
)
B_scatter.update_layout(
title='Fitted values vs residuals',
yaxis=dict(
title='Residuals'
),
xaxis=dict(
title='Fitted values'
),
width=600,
height=400
)
B_scatter.show()A couple bigger residuals for smaller fitted values, but overall no clear pattern.
Item C model
C_residuals = C_model.resid
C_fitted_values = C_model.fittedvalues
C_scatter = px.scatter(x=C_fitted_values, y=C_residuals)
C_scatter.add_hline(
y=0,
line_dash='dash',
line_color='red'
)
C_scatter.update_layout(
title='Fitted values vs residuals',
yaxis=dict(
title='Residuals'
),
xaxis=dict(
title='Fitted values'
),
width=600,
height=400
)
C_scatter.show()A couple bigger residuals for smaller fitted values, but overall no clear pattern.